3 research outputs found
TranssionADD: A multi-frame reinforcement based sequence tagging model for audio deepfake detection
Thanks to recent advancements in end-to-end speech modeling technology, it
has become increasingly feasible to imitate and clone a user`s voice. This
leads to a significant challenge in differentiating between authentic and
fabricated audio segments. To address the issue of user voice abuse and misuse,
the second Audio Deepfake Detection Challenge (ADD 2023) aims to detect and
analyze deepfake speech utterances. Specifically, Track 2, named the
Manipulation Region Location (RL), aims to pinpoint the location of manipulated
regions in audio, which can be present in both real and generated audio
segments. We propose our novel TranssionADD system as a solution to the
challenging problem of model robustness and audio segment outliers in the trace
competition. Our system provides three unique contributions: 1) we adapt
sequence tagging task for audio deepfake detection; 2) we improve model
generalization by various data augmentation techniques; 3) we incorporate
multi-frame detection (MFD) module to overcome limited representation provided
by a single frame and use isolated-frame penalty (IFP) loss to handle outliers
in segments. Our best submission achieved 2nd place in Track 2, demonstrating
the effectiveness and robustness of our proposed system
Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP
In the realm of expressive Text-to-Speech (TTS), explicit prosodic boundaries
significantly advance the naturalness and controllability of synthesized
speech. While human prosody annotation contributes a lot to the performance, it
is a labor-intensive and time-consuming process, often resulting in
inconsistent outcomes. Despite the availability of extensive supervised data,
the current benchmark model still faces performance setbacks. To address this
issue, a two-stage automatic annotation pipeline is novelly proposed in this
paper. Specifically, in the first stage, we propose contrastive text-speech
pretraining of Speech-Silence and Word-Punctuation (SSWP) pairs. The
pretraining procedure hammers at enhancing the prosodic space extracted from
joint text-speech space. In the second stage, we build a multi-modal prosody
annotator, which consists of pretrained encoders, a straightforward yet
effective text-speech feature fusion scheme, and a sequence classifier.
Extensive experiments conclusively demonstrate that our proposed method excels
at automatically generating prosody annotation and achieves state-of-the-art
(SOTA) performance. Furthermore, our novel model has exhibited remarkable
resilience when tested with varying amounts of data.Comment: Submitted to ICASSP 202